K-Means Presentation

An Introduction to K-means clustering

Finding meaningful patterns within data has become obtrusive as data collection and management continues to grow at an unprecedented rate.

K-means clustering caters to such highly voluminous & unlabeled data.

We will employ the k-means clustering algorithm to gain insights on customer segmentation in eCommerce data for a retail store based in the UK.

By the end of this presentation we will have discussed the following concepts of k-means clustering:

  • Unsupervised learning
  • Clustering
  • Mathematical foundation of the k means algorithm by the Euclidean Distance.

Data Preparation

  • Data from online retailer based in UK
  • 406,829 observations of customer transactions

Raw Customer Data

Second tab of columns

Cleaned Data

  • Nulls removed
  • Subset of data without UK sales.
  • Removed quantities \(<0\)
  • Created variables \(Sales\), \(Orders\) and \(AvgSales\) to capture customer spending habits Data by Customer acquisition

Data Distribution

Data favors UK

Data favors UK

Distribution by CustomerID

Data distribution by CustomerID

Data spread for Orders

An Introduction to K-means clustering

There are 5 main steps to execute the k-means clustering method.

  • Assign all data points to a cluster noted as \(k\).
  • Set \(k\) random seed points.
  • Reassign all data points to a cluster using distance to seed point.
  • Compute the mean center of each new cluster noted as a centroid.
  • Loop the re-assignment step of all data points to a new centroid based on its minimized relocation, until the centroid does not change.

What is Unsupervised learning

  • Statistical modeling technique used to categorize/group data without the assistance of historical results or human intervention.
  • Autonomous data categorization
  • Creates a data driven outcome

Clustering

  • Clustering is the act of partitioning data into meaningful groups based on similarity in attributes.

  • The goal of clustering is to create insightful clusters to better understand connections in the data.

Clustering

  • \(k=4\), the seed point are green, forming 4 respective clusters
  • Once the initial assignment of centroids is made the Euclidean distance is used to establish cluster members by minimal distance noted as the red lines in the figure.

centroids

Euclidean Distance

  • Euclidean distance is used to find dissimilarity between data points in order to group similar data instances into clusters
  • In k-means it is done so by calculating the distance between a data object and cluster center by:

\[d(x,C_i)=sqrt(\sum_{i=1}^{N} (x_j−C_{ij})^2)\]

Objective Function:

  • The k-means clustering objective is to minimize the within-cluster variance.

It is formulated as:

\[ d(x,C_i)=(\sum_{i=1}^{k}*\sum_{x \in C_i}^{}(||x-\mu_i||)^2) \]

Euclidean Distance

  • \(k\) is the number of clusters.

  • \(C_i\) represents the number of points in the cluster \(i\)

  • \(\mu_i\) represents the centroid mean of cluster \(i\)

  • In this context, similarity is inversely related to the Euclidean distance

  • The smaller the distance, the greater the similarity between objects

New Centroids

  • K-means clustering reassigns the data points to each cluster based on the Euclidean Distance calculation.

  • A new centroid location is set by updating the position at each clusters mean center.

data points reassigned to centroids

Determine K

  • The elbow method plots the within cluster sum of squares by \(k\).
```{r}
# Create an empty vector to store WCSS values
wcss <- vector("numeric", length = 10)
# Iterate over a range of K values (e.g., from 1 to 10)
for (i in 1:10) {
  model <- kmeans(df_norm[c("Sales", "Orders", "AvgSale")], centers = i, nstart = 10)
  wcss[i] <- ceiling(model$tot.withinss)
}
```

Elbow Method

library(ggplot2)
# Plot the WCSS values against the number of clusters
p1<- ggplot(data.frame(K=1:10, WCSS=wcss), aes(x=K, y=WCSS)) +
  geom_line() +
  geom_point() +
  labs(title="Elbow Method to Find Optimal K", x="Number of Clusters (K)", y="Within-Cluster-Sum-of-Squares (WCSS)") +
  scale_x_continuous(breaks = seq(0, 10, by = 1))

Clustering Model & Results:

set.seed(100)
model1 <- kmeans(df_norm[c("Sales", "Orders", "AvgSale")],4)

Cluster Results

Cluster Interpretation

Since the goal is to evaluate the clusters and find meaning in the results. We can make connections with purchasing frequency and amount to marketing and pricing strategies to promote customer satisfaction.

  • cluster 1: potentially newer customers, large target for marketing.

  • cluster 2: target for marketing a select range of costlier items.

  • cluster 3: likely frequent customers, may benefit from low to mid priced item recommendations to increase or maintain engagement.

  • cluster 4: lowest investment return, lowest interactions, least concerns

Future Works

Over the past sixty years, improvements and additions have been made to the base k-means methodology in order to optimize performance and increase scalability

  • This data may benefit from the exploration of differences in distance formulas, Mahatten distance and Chebyshev distance.

  • Additional \(k\) selection methods such as the Empirical and Silhouette methods, and the use of principal component analysis or other dimensionality reduction techniques to improve efficiency and overall clustering results.